Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric

نویسندگان

  • Matthew G. Snover
  • Nitin Madnani
  • Bonnie J. Dorr
  • Richard M. Schwartz
چکیده

Automatic Machine Translation (MT) evaluation metrics have traditionally been evaluated by the correlation of the scores they assign to MT output with human judgments of translation performance. Different types of human judgments, such as Fluency, Adequacy, and HTER, measure varying aspects of MT performance that can be captured by automatic MT metrics. We explore these differences through the use of a new tunable MT metric: TER-Plus, which extends the Translation Edit Rate evaluation metric with tunable parameters and the incorporation of morphology, synonymy and paraphrases. TER-Plus was shown to be one of the top metrics in NIST’s Metrics MATR 2008 Challenge, having the highest average rank in terms of Pearson and Spearman correlation. Optimizing TER-Plus to different types of human judgments yields significantly improved correlations and meaningful changes in the weight of different types of edits, demonstrating significant differences between the types of human judgments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility based on semantic roles

We introduce a novel semi-automated metric, MEANT, that assesses translation utility by matching semantic role fillers, producing scores that correlate with human judgment as well as HTER but at much lower labor cost. As machine translation systems improve in lexical choice and fluency, the shortcomings of widespread n-gram based, fluency-oriented MT evaluation metrics such as BLEU, which fail ...

متن کامل

On the reliability and inter-annotator agreement of human semantic MT evaluation via HMEANT

We present analyses showing that HMEANT is a reliable, accurate and fine-grained semantic frame based human MT evaluation metric with high inter-annotator agreement (IAA) and correlation with human adequacy judgments, despite only requiring a minimal training of about 15 minutes for lay annotators. Previous work shows that the IAA on the semantic role labeling (SRL) subtask within HMEANT is ove...

متن کامل

Automatic Syntactic MT Evaluation with Expected Dependency Pair Match

Previous work on dependency-based machine translation evaluation has shown that an LFGbased F-measure of labeled partial dependencies over an n-best list has improved correlation with human judgments of fluency and adequacy as compared to BLEU and TER. Inspired by SPARSEVAL, we demonstrate that a statistical syntactic parser based on PCFGs may be used in place of the LFG dependencies, and we ex...

متن کامل

MEANT at WMT 2013: A Tunable, Accurate yet Inexpensive Semantic Frame Based MT Evaluation Metric

The linguistically transparentMEANT and UMEANT metrics are tunable, simple yet highly effective, fully automatic approximation to the human HMEANT MT evaluation metric which measures semantic frame similarity between MT output and reference translations. In this paper, we describe HKUST’s submission to the WMT 2013 metrics evaluation task, MEANT and UMEANT. MEANT is optimized by tuning a small ...

متن کامل

Translation Adequacy and Preference Evaluation Tool (TAP-ET)

Evaluation of Machine Translation (MT) technology is often tied to the requirement for tedious manual judgments of translation quality. While automated MT metrology continues to be an active area of research, a well known and often accepted standard metric is the manual human assessment of adequacy and fluency. There are several software packages (RWTH, 2000) (LDC, 2005) that have been used to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009